Lecture 3
Developers Pandas maybe < 100 core and thousands of contributors/followers
Developers Numpy maybe < 50 core and a bit more than 1000 contributors/followers
Developers Matplotlib maybe < 40 core and < 1000 contributors/followers
Developers Seaborn like one core developer and maybe 50 contributors
The name “Pandas” comes from “Panel Data,” a social science term for data that include observations over multiple time periods for the same individuals—or panels.
Pandas core data structure is the DataFrame it is designed to handle and analyze structured data easily.
A DataFrame is just a table of data organized into rows and columns. Like an R data frame, a spreadsheet or a SQL table.
Like R, we can read csv (and many others)
pd calls up pandas and read_csv() is the specific function from pandas used to read csv files
The dataset can reside on the web
It is wise (not required) to specify an index column when we have (1) time series data or (2) unique IDs.
This makes grouping, lookups, joins faster and easier
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2019-01-02 | 38.722500 | 39.712502 | 38.557499 | 39.480000 | 37.845039 | 148158800 |
| 2019-01-03 | 35.994999 | 36.430000 | 35.500000 | 35.547501 | 34.075397 | 365248800 |
| 2019-01-04 | 36.132500 | 37.137501 | 35.950001 | 37.064999 | 35.530060 | 234428400 |
| Unnamed: 0 | name | category | price | old_price | sellable_online | link | other_colors | short_description | designer | depth | height | width | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| item_id | |||||||||||||
| 90420332 | 0 | FREKVENS | Bar furniture | 265.0 | No old price | True | https://www.ikea.com/sa/en/p/frekvens-bar-tabl... | No | Bar table, in/outdoor, 51x51 cm | Nicholai Wiig Hansen | NaN | 99.0 | 51.0 |
| 368814 | 1 | NORDVIKEN | Bar furniture | 995.0 | No old price | False | https://www.ikea.com/sa/en/p/nordviken-bar-tab... | No | Bar table, 140x80 cm | Francis Cayouette | NaN | 105.0 | 80.0 |
| 9333523 | 2 | NORDVIKEN / NORDVIKEN | Bar furniture | 2095.0 | No old price | False | https://www.ikea.com/sa/en/p/nordviken-nordvik... | No | Bar table and 4 bar stools | Francis Cayouette | NaN | NaN | NaN |
Python is OOP, R is functional programming language.
A Pandas DataFrame in the abstract is a “class”
The Python objects AAPL and ikea are Pandas DataFrame realized or an instance of the Pandas DataFrame class.
Methods are defined within a class’s definition and are associated with specific objects.
Functions can be defined independently of classes and are not necessarily associated with any objects.
If you are using Jupyter, you don’t need plt.show();, but here (using Quarto chunks) I do. If you are generating simple graphics.
plt.plot(AAPL.index, AAPL.Open, color='purple',
linestyle='-', linewidth=0.25, label='Open')
plt.plot(AAPL.index, AAPL.Close, color='blue', linestyle='-', linewidth=0.25, label='Close')
plt.plot(AAPL.index, AAPL.High, color='green', linestyle='-', linewidth=0.25, label='High')
plt.plot(AAPL.index, AAPL.Low, color='red', linestyle='-', linewidth=0.25, label='Low')
plt.legend()
plt.title(r'AAPL Open/Close/High/Low ', fontsize=20)
plt.xlabel('Date')
plt.ylabel('US Dollars')
plt.show();You will always need Matplotlib
But Matplotlib is not as well suited for statistical graphics (not the “core” mission)
Contrast Seaborn is all about statistical graphics
Pandas is all about data, has some statistical graphic capability but it is much more widely supported than Seaborn
Designed for statistical graphics
Basic plots are more appealing than Pandas
Easier to facet
Has more complicated options than Pandas
Requires a lot of effort to make a simple statistical plot
But offers maximum customization
The primary foundation for all graphics in Python
Basic visual appeal not as nice as Seaborn but can be customized to be nicer (with much more work)
# adjust figure size
plt.figure(figsize=(14, 6))
# Create a boxplot
sns.boxplot(x='category', y='price', hue='category', data = ikea)
# Set titles and labels using Matplotlib
plt.xlabel('Category')
plt.xticks(rotation = 45, fontsize = 7, ha = 'right')
plt.ylabel('Price')
plt.title('Comparison of Price Across Category')
plt.show();Helps Exploration
Comparisons made easier
Clarity (can reduce overplotting)
Not specific to one industry/field
In Pandas and in Matplotlib, faceting would require first creating subplots (one for each facet)
Then process the categories in each facet, so in the Ikea example we would filter the DataFrame for a category, and then plotting the category in its designated subplot.
If you were to avoid Seaborn, choose Matplotlib over Pandas
I asked ChatGPT how to facet the Pandas boxplot. Here is its response. Suppose we had regions in the Ikea data:
# Unique regions to facet by
regions = ikea['region'].unique()
# Create a figure with a subplot for each region
fig, axes = plt.subplots(nrows=1, ncols=len(regions), figsize=(5 * len(regions), 5), sharey=True)
# Loop over each region and create a boxplot in the corresponding subplot
for ax, region in zip(axes, regions):
subset = ikea[ikea['region'] == region]
subset.boxplot(column='price', by='category', ax=ax)
ax.set_title(f'Region: {region}')
ax.set_xlabel('Category')
ax.set_ylabel('Price')
ax.tick_params(axis='x', rotation=45) # Optional: Rotate x-axis labels for clarity
# Adjust layout and display the plot
plt.tight_layout()
plt.suptitle('Price Distribution by Category and Region') # Set the overall title
plt.subplots_adjust(top=0.85) # Adjust the top margin to fit the suptitle
plt.show()ggplot in Python
about 100 developers and about 200 contributors
still very new (about a year ago) with many issues (e.g., interacting with Matplotlib)